Contents

Loading the data

In this demo, we use CyTOF data of PBMC samples from 4 patients, each containing between 500 and 1000 cells. For each sample, the expression of 10 cell surface and 14 signaling markers was measured before (REF) and upon BCR/FcR-XL stimulation (BCRXL) with B cell receptor/Fc receptor crosslinking for 30’, resulting in a total of 8 samples.

This data is stored as a flowSet object PBMC_fs, where each sample is stored in a separate flowFrame:

library(CATALYST)
## Warning: package 'S4Vectors' was built under R version 4.1.3
data(PBMC_fs)
PBMC_fs
## A flowSet with 8 experiments.
## 
## column names(24): CD3(110:114)Dd CD45(In115)Dd ... HLA-DR(Yb174)Dd
##   CD7(Yb176)Dd

Additionally, we load two metadaa tables: PBMC_md and PBMC_panel that contain metadata on each sample and marker, respectively:

data(PBMC_md, PBMC_panel)
PBMC_md
##                 file_name sample_id condition patient_id
## 1 PBMC_patient1_BCRXL.fcs    BCRXL1     BCRXL   Patient1
## 2   PBMC_patient1_Ref.fcs      Ref1       Ref   Patient1
## 3 PBMC_patient2_BCRXL.fcs    BCRXL2     BCRXL   Patient2
## 4   PBMC_patient2_Ref.fcs      Ref2       Ref   Patient2
## 5 PBMC_patient3_BCRXL.fcs    BCRXL3     BCRXL   Patient3
## 6   PBMC_patient3_Ref.fcs      Ref3       Ref   Patient3
## 7 PBMC_patient4_BCRXL.fcs    BCRXL4     BCRXL   Patient4
## 8   PBMC_patient4_Ref.fcs      Ref4       Ref   Patient4
head(PBMC_panel)
##      fcs_colname antigen marker_class
## 1 CD3(110:114)Dd     CD3         type
## 2  CD45(In115)Dd    CD45         type
## 3 pNFkB(Nd142)Dd   pNFkB        state
## 4  pp38(Nd144)Dd    pp38        state
## 5   CD4(Nd145)Dd     CD4         type
## 6  CD20(Sm147)Dd    CD20         type

Constructing a SingleCellExperiment

From the three data objects above, we construct an object of the SingleCellExperiment (SCE) class, which extends the SummarizedExperiment class with specialized methods for single-cell data:

sce <- prepData(PBMC_fs, PBMC_panel, PBMC_md)

In a SCE, assays contain matrix-like measurement data (e.g., transcript counts
in scRNA-seq, ion counts in CyTOF, fluorescent intensities in FACS), where rows = features (transcripts, genes, proteins) and columns = observations (single cells, samples):

# our SCE contains ion counts & 
# expression-like transformed counts
assayNames(sce)
## [1] "counts" "exprs"
# we have 24 markers & about 5.5k cells
dim(sce)
## [1]   24 5428
rownames(sce)
##  [1] "CD3"    "CD45"   "pNFkB"  "pp38"   "CD4"    "CD20"   "CD33"   "pStat5"
##  [9] "CD123"  "pAkt"   "pStat1" "pSHP2"  "pZap70" "pStat3" "CD14"   "pSlp76"
## [17] "pBtk"   "pPlcg2" "pErk"   "pLat"   "IgM"    "pS6"    "HLA-DR" "CD7"

Two DataFrame objects, row/colData, accompany rows/columns and store feature/observation metadata:

head(rowData(sce))
## DataFrame with 6 rows and 3 columns
##         channel_name marker_name marker_class
##          <character> <character>     <factor>
## CD3   CD3(110:114)Dd         CD3        type 
## CD45   CD45(In115)Dd        CD45        type 
## pNFkB pNFkB(Nd142)Dd       pNFkB        state
## pp38   pp38(Nd144)Dd        pp38        state
## CD4     CD4(Nd145)Dd         CD4        type 
## CD20   CD20(Sm147)Dd        CD20        type
head(colData(sce))
## DataFrame with 6 rows and 3 columns
##   sample_id condition patient_id
##    <factor>  <factor>   <factor>
## 1    BCRXL1     BCRXL   Patient1
## 2    BCRXL1     BCRXL   Patient1
## 3    BCRXL1     BCRXL   Patient1
## 4    BCRXL1     BCRXL   Patient1
## 5    BCRXL1     BCRXL   Patient1
## 6    BCRXL1     BCRXL   Patient1

Additionally, the metadata slot stores a list of experiment-wide metadata:

names(md <- metadata(sce))
## [1] "experiment_info"
head(md$experiment_info)
##   sample_id condition patient_id n_cells
## 1    BCRXL1     BCRXL   Patient1     528
## 2      Ref1       Ref   Patient1     881
## 3    BCRXL2     BCRXL   Patient2     665
## 4      Ref2       Ref   Patient2     438
## 5    BCRXL3     BCRXL   Patient3     563
## 6      Ref3       Ref   Patient3     660

We can quickly visualize the distribution of cells across the 8 samples:

plotCounts(sce, group_by = "sample_id", color_by = "patient_id")

1 Clustering

Next, we use the cluster() function to perform FlowSOM clustering and ConsensusClusterPlus metaclustering into xdim x ydim clusters, and 2 through maxK metaclusters:

sce <- cluster(sce, xdim = 10, ydim = 10, maxK = 20, verbose = FALSE)

All clusterings (resolutions 100, 2, 3, …, 20) are now available in the SCE. We can acess the cluster assignments for a specific clustering k via cluster_ids(., k), or view a table of all available clusterings with cluster_codes(.):

# view available clusterings
names(codes <- cluster_codes(sce))
##  [1] "som100" "meta2"  "meta3"  "meta4"  "meta5"  "meta6"  "meta7"  "meta8" 
##  [9] "meta9"  "meta10" "meta11" "meta12" "meta13" "meta14" "meta15" "meta16"
## [17] "meta17" "meta18" "meta19" "meta20"
# tabulate cells for a specific resolution
table(cluster_ids(sce, "meta8"))
## 
##    1    2    3    4    5    6    7    8 
##  708 1195 1293  449  537  605  297  344

A neat way to visualize how the different resolutions are hierarchically linked is with the clustree package:

library(clustree)
clustree(codes, prefix = "meta")

2 Type- & state-markers

The methodology presented here is based on a dichotomy of markers into “type” markers (used for clustering cells into subpopulations) and “state” markers (tested for differences in expression across conditions). The set of type and state markers defined in the SCE can be viewed via type/state_markers(.):

type_markers(sce)
##  [1] "CD3"    "CD45"   "CD4"    "CD20"   "CD33"   "CD123"  "CD14"   "IgM"   
##  [9] "HLA-DR" "CD7"
state_markers(sce)
##  [1] "pNFkB"  "pp38"   "pStat5" "pAkt"   "pStat1" "pSHP2"  "pZap70" "pStat3"
##  [9] "pSlp76" "pBtk"   "pPlcg2" "pErk"   "pLat"   "pS6"

By default, cluster() has used the type markers to group cells into subpopulations. Thus, we could use the expression of these markers to assign biologically meaningful labels to our clusters:

plotExprHeatmap(sce, features = "type", by = "cluster_id")

3 Dimensionality reduction

One of the most popular ways to visualize single cell data is via dimensionality reduction (DR), where each cell is projected into a lower, usually two-dimensional, space. Here, we compute a nonlinear dimensionality reduction technique, uniform manifold approximation and projection (UMAP), using the runDR() function:

sce <- runDR(sce, dr = "UMAP")

We can visualize the above two-dimensional embedding of cells using plotDR() and, for example, color cells by their cluster assignment, patient ID, and experimental treatment group:

plotDR(sce, color_by = "meta12")

plotDR(sce, color_by = "patient_id")

plotDR(sce, color_by = "condition")

We can also color cells by the expression of specific markers. And furthermore split the plot by another variable of interest, say treatment:

plotDR(sce, color_by = c("pS6", "pBtk", "pp38"), facet_by = "condition")

4 Pseudobulk-level MDS plot

pbMDS() renders a multi-dimensional scaling (MDS) plot on median expression values. Such a plot will give a sense of similarities between samples in an unsupervised way and of key difference in expression before conducting any formal testing:

pbMDS(sce, by = "sample_id")

pbMDS(sce, by = "both")

5 Differential analysis

The diffcyt package implements methods for differential discover
in high-dimensional cytometry data - differential state (DS) analysis to test for subpopulation-specific changes in expression across experimental conditions - differential abundance (DA) analysis to test changes in relative abundance (frequency) of subpopulations across conditions

5.1 DS analysis

Without conducting any formal testing, we can visualize the expression of our 14 state markers (not used for clustering) across all samples to see whether there are visible changes across conditions:

plotMultiHeatmap(sce, hm1 = FALSE, scale = "never",
    hm2 = state_markers(sce)[1:7])

plotMultiHeatmap(sce, hm1 = FALSE, scale = "never",
    hm2 = state_markers(sce)[8:14])

To formally identify subpopulation-specific changes in expression across conditions, we first set up a model formula and contrast matrix, and then use the diffcyt() function for statistical testing:

library(diffcyt)
ei <- md$experiment_info
design <- createDesignMatrix(ei, "condition")
contrast <- createContrast(c(0, 1))

ds <- diffcyt(sce, ei, analysis_type = "DS", method_DS = "diffcyt-DS-limma",
    design = design, contrast = contrast, clustering_to_use = "meta20")

diffcyt() returns a list of various objects. Differential testing results are stored in the first element’s rowData:

head(tbl <- rowData(ds$res))
## DataFrame with 6 rows and 9 columns
##   cluster_id marker_id          ID     logFC   AveExpr         t       p_val
##     <factor>  <factor> <character> <numeric> <numeric> <numeric>   <numeric>
## 1          1     pNFkB           1  -1.47685   1.79315  -4.40364 0.001460778
## 2          2     pNFkB           2  -1.48163   1.54605  -5.81654 0.000197358
## 3          3     pNFkB           3  -1.02816   1.81292  -3.99371 0.002753439
## 4          4     pNFkB           4  -2.14277   2.42447  -5.55022 0.000281649
## 5          5     pNFkB           5  -2.26421   1.81401  -4.76644 0.000850179
## 6          6     pNFkB           6  -1.45570   1.75282  -5.09587 0.000528678
##        p_adj         B
##    <numeric> <numeric>
## 1 0.01573146 -1.062223
## 2 0.00451360  0.892961
## 3 0.02513515 -2.114418
## 4 0.00525744  0.785925
## 5 0.01159953 -0.276018
## 6 0.00870764 -0.326383

Having extracted the above table, we can easily tabulate the number of marker-cluster combinations that are significant at a given FDR cutoff, say 5%:

fdr <- 0.05
table(tbl$p_adj < 0.05)
## 
## FALSE  TRUE 
##   242    38

Lastly, we can visualize the top differential findings across all clusters:

plotDiffHeatmap(sce, tbl, top_n = 20)

5.2 DA analysis

So far, we’ve mostly looked at expression of markers across different samples, conditions and clusters. We could also investigate the relative abundance (frequency) of subpopulations across samples and treatment groups:

plotAbundances(sce, by = "sample_id")

plotAbundances(sce, by = "cluster_id")

Similar to the plots generated above, we can render a heatmap representation of these frequencies:

plotFreqHeatmap(sce)